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Abstract. We draw a certain analogy between the classical information-theoretic 
problem of lossy data compression (source coding) of memoryless information sources 
and the statistical mechanical behavior of a certain model of a chain of connected 
particles (e.g., a polymer) that is subjected to a contracting force. The free energy 
difference pertaining to such a contraction turns out to be proportional to the rate- 
distortion function in the analogous data compression model, and the contracting force 
is proportional to the derivative this function. Beyond the fact that this analogy may 
be interesting on its own right, it may provide a physical perspective on the behavior 
of optimum schemes for lossy data compression (and perhaps also, an information- 
theoretic perspective on certain physical system models). Moreover, it triggers the 
derivation of lossy compression performance for systems with memory, using analysis 
tools and insights from statistical mechanics. 
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1. Introduction 

Relationships between information theory and statistical physics have been widely 
recognized in the last few decades, from a wide spectrum of aspects. These include 
conceptual aspects, of parallelisms and analogies between theoretical principles in 
the two disciplines, as well as technical aspects, of mapping between mathematical 
formalisms in both fields and borrowing analysis techniques from one field to the other. 
One example of such a mapping, is between the paradigm of random codes for channel 
coding and certain models of magnetic materials, most notably, Ising models and spin 
glass models (see, e.g., [1] , [H] , p3] , [H] , and many references therein). Today, it is 
quite widely believed that research in the intersection between information theory and 
statistical physics may have the potential of fertilizing both disciplines. 

This paper is more related to the former aspect mentioned above, namely, the 
relationships between the two areas in the conceptual level. However, it has also 
ingredients from the second aspect. In particular, let us consider two questions in the 
two fields, which at first glance, may seem completely unrelated, but will nevertheless 
turn out later to be very related. These are special cases of more general questions that 
we study later in this paper. 

The first is a simple question in statistical mechanics, and it is about a certain 
extension of a model described in (HI page 134, Problem 13]: Consider a one-dimensional 
chain of n connected elements (e.g., monomers or whatever basic units that form a 
polymer chain), arranged along a straight line (see Fig. [TJ), and residing in thermal 
equilibrium at fixed temperature T . The are two types of elements, which will be 
referred to as type '0' and type '1'. The number of elements of each type x (with x 
being either '0' or '1') is given by n(x) = nP(x), where -P(O) + -P(l) = 1 (and so, 
n(0) + n(l) = n). Each element of each type may be in one of two different states, 
labeled by x, where x also takes on the values '0' and '1'. The length and the internal 
energy of an element of type x at state x are given by d(x, x) and e(x) (independently 
of x), respectively. A contracting force A < is applied to one edge of the chain while 
the other edge is fixed. What is the minimum amount of mechanical work W that must 
be carried out by this force, along an isothermal process at temperature T , in order to 
shrink the chain from its original length nDo (when no force was applied) into a shorter 
length, nD, where D < D is a given constant? 



type '0' type '1' 




Figure 1. A chain with various types of elements and various lengths. 



The second question is in information theory. In particular, it is the classical 
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problem of lossy source coding, and some of the notation here will deliberately be 
chosen to be the same as before: An information source emits a string of n independent 
symbols, Xx,X2, ■ ■ ■ ,x n , where each Xi may either be '0' or '1', with probabilities P(0) 
and -P(l), respectively. A lossy source encoder maps the source string, . . . , x n ), into 
a shorter (compressed) representation of average length nR, where R is the coding rate 
(compression ratio), and the compatible decoder maps this compressed representation 
into a reproduction string, ), where each Xi is again, either '0' or '1'. The 

fidelity of the reproduction is measured in terms of a certain distortion (or distance) 
function, d(x, x) = Yh=i d(xi, Xj), which should be as small as possible, so that x would 
be as 'close' as possible to x\g In the limit of large n, what is the minimum coding 
rate R = R(D) for which there exists an encoder and decoder such that the average 
distortion, (d(x,x)), would not exceed nDl 

It turns out, as we shall see in the sequel, that the two questions have intimately 
related answers. In particular, the minimum amount of work W, in the first question, is 
related to R(D) (a.k.a. the rate-distortion function) , of the second question, according 
to 



where k is Boltzmann's constant, and Q(x) is the relative frequency (or the empirical 
probability) of the symbol x G {0, 1} in the reproduction sequence x, pertaining to 
an optimum lossy encoder-decoder with average per-symbol distortion D (for large n). 
Moreover, the minimum amount of work W, which is simply the free energy difference 
between the final equilibrium state and the initial state of the chain, is achieved by a 
reversible process, where the compressing force A grows very slowly from zero, at the 
beginning of the process, up to a final level of 



where R'(D) is the derivative of R(D) (see Fig. [2]). Thus, physical compression 
is strongly related to data compression, and the fundamental physical limit on the 
minimum required work is intimately connected to the fundamental information- 
theoretic limit of the minimum required coding rate. 

This link between the the physical model and the lossy source coding problem is 
obtained from a large deviations perspective. The exact details will be seen later on, 
but in a nutshell, the idea is this: On the one hand, it is possible to represent R{D) 
as the large deviations rate function of a certain rare event, but on the other hand, 
this large deviations rate function, involves the use of the Legendre transform, which 
is a pivotal concept in thermodynamics and statistical mechanics. Moreover, since this 

| For example, in lossless compression, x is required to be strictly identical to i, in which case 
d(x, x) = 0. However, in some applications, one might be willing to trade off between compression and 
fidelity, i.e., slightly increase the distortion at the benefit of reducing the compression ratio R. 



W = nkT R(D), 

provided that the Hamiltonian, e(x), in the former problem, is given by 



(1) 




(2) 



A = kT R'(D), 



(3) 



A statistical-mechanical view on source coding 



4 



W = nkToR(D) 



nD 



3- 



A = kToR'(D) 



3 



nDo 



Figure 2. Emulation of the rate-distortion function R(D) by a physical system. 



Legendre transform is applied to the (logarithm of the) moment generating function 
(of the distortion variable), which in turn, has the form a partition function, this paves 
the way to the above described analogy. The Legendre transform is associated with 
the optimization across a certain parameter, which can be interpreted as either inverse 
temperature (as was done, for example, in [9], [10], [15], [16]) or as a (generalized) force, 
as proposed here. The interpretation of this parameter as force is somewhat more solid, 
for reasons that will become apparent later. 

One application of this analogy, between the two models, is a parametric 
representation of the rate-distortion function R{D) as an integral of the minimum mean 
square error (MMSE) in a certain Bayesian estimation problem, which is obtained in 
analogy to a certain variant of the fluctuation-dissipation theorem. This representation 
opens the door for derivation of upper and lower bounds on the rate-distortion function 
via bounds on the MMSE, as was demonstrated in a companion paper [T2] . 

Another possible application is demonstrated in the present paper: When the setup 
is extended to allow information sources with memory (non i.i.d. processes), then the 
analogous physical model consists of interactions between the various particles. When 
these interactions are sufficiently strong (and with high enough dimension), then the 
system exhibits phase transitions. In the information-theoretic domain, these phase 
transitions mean irregularities and threshold effects in the behavior of the relevant 
information-theoretic function, in this case, the rate-distortion function. Thus, analysis 
tools and physical insights are 'imported' from statistical mechanics to information 
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theory. A particular model example for this is worked out in Section 4. 

The outline of the paper is as follows. In Section 2, we provide some relevant 
background in information theory, which may safely be skipped by readers that possess 
this background. In Section 3, we establish the analogy between lossy source coding and 
the above described physical model, and discuss it in detail. In Section 4, we demonstrate 
the analysis for a system with memory, as explained in the previous paragraph. Finally, 
in Section 5 we summarize and conclude. 

2. Information— theoretic background 

2.1. General overview 

One of the most elementary roles of Information Theory is to provide fundamental 
performance limits pertaining to certain tasks of information processing, such as 
data compression, error-correction coding, encryption, data hiding, prediction, and 
detection/estimation of signals and/or parameters from noisy observations, just to name 
a few (see e.g., [3]). 

In this paper, our focus is on the first item mentioned - data compression, 
a.k.a. source coding, where the mission is to convert a piece of information (say, a 
long file), henceforth referred to as the source data, into a shorter (normally, binary) 
representation, which enables either perfect recovery of the original information, as in the 
case of lossless compression, or non-perfect recovery, where the level of reconstruction 
errors (or distortion) should remain within pre-specified limits, which is the case of lossy 
data compression. 

Lossless compression is possible whenever the statistical characterization of the 
source data inherently exhibits some level of redundancy that can be exploited by the 
compression scheme, for example, a binary file, where the relative frequency of l's 
is much larger than that of the O's, or when there is a strong statistical dependence 
between consecutive bits. These types of redundancy exist, more often than not, in 
real-life situations. If some level of errors and distortion are allowed, as in the lossy 
case, then compression can be made even more aggressive. The choice between lossless 
and lossy data compression depends on the application and the type of data to be 
compressed. For example, when it comes to sensitive information, like bank account 
information, or a piece of important text, then one may not tolerate any reconstruction 
errors at all. On the other hand, images and audio/video files, may suffer some degree 
of harmless reconstruction errors (which may be unnoticeable to the human eye or ear, 
if designed cleverly) and thus allow stronger compression, which would be very welcome, 
since images and video files are typically enormously large. The compression ratio, or 
the coding rate, denoted R, is defined as the (average) ratio between the length of the 
compressed file (in bits) and the length of the original file. 

The basic role of Information Theory, in the context of lossless/lossy source 
coding, is to characterize the fundamental limits of compression: For a given statistical 
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characterization of the source data, normally modeled by a certain random process, 
what is the minimum achievable compression ratio R as a function of the allowed 
average distortion, denoted D, which is defined with respect to some distortion function 
that measures the degree of proximity between the source data and the recovered 
data. The characterization of this minimum achievable R for a given D, denoted as 
a function R(D), is called the rate-distortion function of the source with respect to 
the prescribed distortion function. For the lossless case, of course, D — 0. Another 
important question is how, in principle, one may achieve (or at least approach) this 
fundamental limit of optimum performance, R(D)1 In this context, there is a big gap 
between lossy compression and lossless compression. While for the lossless case, there are 
many practical algorithms (most notably, adaptive Huffman codes, Lempel-Ziv codes, 
arithmetic codes, and more), in the lossy case, there is unfortunately, no constructive 
practical scheme whose performance comes close to R(D). 

2.2. The rate- distortion function 

The simplest non-trivial model of an information source is that of an i.i.d. process, a.k.a. 
a discrete memoryless source (DMS), where the source symbols, Xi,x 2 , ■ ■ ■ ,x n , take on 
values in a common finite set (alphabet) X, they are statistically independent, and they 
are all drawn from the same probability mass function, denoted by P = {P(x\ x G X}. 
The source string x = (xi, . . . , x n ) is compressed into a binary representation^ of length 
£ = £(x) (which may or may not depend on x), whose average is (£(x)), and the 
compression ratio is R = (£(x)) jn. In the decoding (or decompression) process, the 
compressed representation is mapped into a reproduction string x = (xi,X2, ■ ■ ■ , x n ), 
where each £i, % — 1, 2, . . . , n, takes on values in the reproduction alphabet X (which is 
typically either equal to X or to a subset of X, but this is not necessary). The fidelity 
of the reconstruction string x relative to the original source string x is measured by 
a certain distortion function d n (x,x), where the function d n is defined additively as 
d n (x, x) = Yh=i d(xi, Xi), d(-, •) being a function from X x X to the non-negative reals. 
The average distortion per symbol is D = (d n (x,x)) jn. 

As said, R(D) is defined (in general) as the infimum of all rates R for which 
there exist a sufficiently large n and an encoder-decoder pair for n-blocks, such that 
the average distortion per symbol would not exceed D. In the case of a DMS P, 
an elementary coding theorem of Information Theory asserts that R(D) admits the 
following formula 



where x is a random variable that represents a single source symbol (i.e., it is governed 

§ It should be noted that in the case of variable-length coding, where I = £(x) depends on x, the 
code should be designed such that the running bit-stream (formed by concatenating compressed strings 
corresponding to successive n-blocks from the source) could be uniquely parsed in the correct manner 
and then decoded. To this end, the lengths {£(x)} must be collectively large enough so as to satisfy 
the Kraft inequality. The details can be found, for example, in [3]. 




(4) 
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by P), I(x; x) is the mutual information between x and x, i.e., 

<5(£) = X^e* P(x)W(x\x) being the marginal distribution of x, which is associated with 
a given conditional distribution and the minimum is over all these conditional 

probability distributions for which 



For D = 0, x must be equal to x with probability one (unless d(x, x) — also for some 
x 7^ x), and then 



the Shannon entropy of x, as expected. As mentioned earlier, there are concrete 
compression algorithms that come close to H for large n. For D > 0, however, the 
proof of achievability of R(D) is non-constructive. 

2.3. Random coding 

The idea for proving the existence of a sequence of codes (indexed by n) whose 
performance approach R(D) as n — > oo, is based on the notion of random coding: If we 
can define, for each n, an ensemble of codes of (fixed) rate R, for which the average per- 
symbol distortion (across both the randomness of x and the randomness of the code) 
is asymptotically less than or equal to D, then there must exist at least one sequence 
of codes in that ensemble, with this property. The idea of random coding is useful 
because if the ensemble of codes is chosen wisely, the average ensemble performance is 
surprisingly easy to derive (in contrast to the performance of a specific code) and proven 
to meet R{D) in the limit of large n. 

For a given n, consider the following ensemble of codes: Let W* denote the 
conditional probability matrix that achieves R{D) and let Q* denote the corresponding 
marginal distribution of x. Consider now a random selection of M = e nR reproduction 
strings, x\, x%, . . . , Xm, each of length n, where each Xi = x ij2 , . . . , x^ n ), % = 
1,2,..., M, is drawn independently (of all other reproduction strings), according to 



This randomly chosen code is generated only once and then revealed to the decoder. 
Upon observing an incoming source string x, the encoder seeks the first reproduction 
string Xi that achieves d n (x, xi) < nD, and then transmits its index i using log 2 M = 
nR log 2 e bits, or equivalently, In M = nR nats^ If no such codeword exists, which 

| While log 2 M has the obvious interpretation of the number of bits needed to specify a number 
between 1 and M, the natural base logarithm is often mathematically more convenient to work with. 
The quantity In M can also be thought of as the description length, but in different units, called nats, 
rather than bits, where the conversion is according to 1 nat = log 2 e bits. 




(6) 




(7) 




(8) 
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is referred to as the event of encoding failure, the encoder sends an arbitrary sequence 
of nR nats, say, the all-zero sequence. The decoder receives the index i and simply 
outputs the corresponding reproduction string X{. 

Obviously, the per-symbol distortion would be less than D whenever the encoder 
does not fail, and so, the main point of the proof is to show that the probability of 
failure (across the randomness of x and the ensemble of codes) is vanishingly small 
for large n, provided that R is slightly larger than (but can be arbitrarily close to) 
R(D), i.e., R = R{D) + e for an arbitrarily small e > 0. The idea is that for any 
source string that is typical to P (i.e., the empirical relative frequency of each symbol 
in x is close to its probability), one can show (see, e.g., [3]) that the probability that 
a single, randomly selected reproduction string x would satisfy d n (x,x) < nD, decays 
exponentially as exp[— nR(D)]. Thus, the above described random selection of the 
entire codebook together with the encoding operation, are equivalent to conducting M 
independent trials in the quest for having at least one i for which d n (x, Xi) < nD, 
i = 1,2, ... , M. If M = e n i R ( D )+ e }^ the number of trials is much larger (by a factor of 
e ne ) than the reciprocal of the probability of a single 'success', exp[— nR(D)], and so, 
the probability of obtaining at least one such success (which is case where the encoder 
succeeds) tends to unity as n — > oo. We took the liberty of assuming that source string 
is typical to P because the probability of seeing a non-typical string is vanishingly small. 

2.4- The Large deviations perspective 

From the foregoing discussion, we see that R(D) has the additional interpretation of the 
exponential rate of the probability of the event d n (x, x) < nD, where a? is a given string 
typical to P and x is randomly drawn i.i.d. under Q*. Consider the following chain of 
equalities and inequalities for bounding the probability of this event from above. Letting 
s be a parameter taking an arbitrary non-positive value, we have: 



Pr {d n (x, x) < nD} — Pr < ^ d(xi, Xi) < nD 




n 




i=l 



n n 




xdX i: Xi=x 




= e 



nI{D,s) 



(9) 
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where I(D,s) is defined as 



I(D, s) = exp 



-n 



sD- £ P(x)ln J2 Q*( £ ) e 



id(x,x) 



(10) 



The tightest upper bound is obtained by minimizing it over the range s < 0, which 
is equivalent to maximizing I(D,s) in that range. I.e., the tightest upper bound of 
this form is e~ nI ^ D \ where 1(D) = sup s<0 I(D, s) (the Chernoff bound). While this 
is merely an upper bound, the methods of large deviations theory (see, e.g., [I]) can 
readily be used to establish the fact that the bound e~ nI ^ is tight in the exponential 
sense, namely, it is the correct asymptotic exponential decay rate of Pr{d n (x, x) < nD}. 
Accordingly, 1(D) is called the large deviations rate function of this event. Combining 
this with the foregoing discussion, it follows that R(D) = 1(D), which means that an 
alternative expression of R(D) is given by 



R(D) 



sup 

s<0 



sD 



sd(x,x) 



^p(x) in y, (?■(■>■ 

Interestingly, the same expression was obtained in [5], Corollary 4.2.3] using completely 
different considerations (see also [15J). In this paper, however, we will also concern 
ourselves, more generally, with the rate-distortion function, Rq(D), pertaining to a 
given reproduction distribution Q, which may not necessarily be the optimum one, Q*. 
This function is defined similarly as in eq. (jl]), but with the additional constraint that 
the marginal distribution that represents the reproduction would agree with the given Q, 
i.e., J2x P(x)W(x\x) = Q(x). By using the same large deviations arguments as above, 
but for an arbitrary random coding distribution Q, one readily observes that Rq(D) 
is of the same form as in eq. ( fill , except that Q* is replaced by the given Q (see also 
[j~2]). This expression will now be used as a bridge to the realm of equilibrium statistical 
mechanics. 



3. Statistical mechanics of source coding 



Consider the parametric representation of the rate-distortion function Rq(D), with 
respect to a given reproduction distribution Q: 



Rq(D) = sup 

s<0 



sD - ]T P(x)\n Q( £ ) e 



sd(x,x) 



(12) 



(13) 



The expression in the inner brackets, 

Z x (s) = £ Q(x)e sd ^\ 
xex 

can be thought of as the partition function of a single particle of "type" x, which 
is defined as follows. Assuming a certain fixed temperature T = T , consider the 
Hamiltonian 



e(x) = —kT \nQ(x). 



(14) 
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Imagine now that this particle may be in various states, indexed by x £ X. When a 
particle of type x lies in state x its internal energy is e(x), as defined above, and its 
length is d(x, x). Next assume that instead of working with the parameter s, we rescale 
and redefine the free parameter as A, where s = X/(kT ). Then, A has the physical 
meaning of a force that is conjugate to the length. This force is stretching for A > and 
contracting for A < 0. With a slight abuse of notation, the Gibbs partition function [7J 
Section 4.8] pertaining to a single particle of type x is then given by 

Z X (X) = ^expj-^fe^-A^^)]}, (15) 

&ex 

and accordingly, 

G x (X) = -kT \nZ x (X) (16) 
is the Gibbs free energy per particle of type x. Thus, 

G(X) = P(x)G x (X) (17) 

x&X 

is the average per-particle Gibbs free energy (or the Gibbs free energy density) 
pertaining to a system with a total of n non-interacting particles, from \X\ different 
types, where the number of particles of type x is nP(x), x £ X. The Helmholtz free 
energy per particle is then given by the Legendre transform 

F(D) = sup[G(A) + XD]. (18) 

A 

However, for D < D = J2xexl2 X £X P{x)Q(x)d(x,x) (which is the interesting range, 
where Rq(D) > 0), the maximizing A is always non-positive, and so, 

F(D) = sup[G(A) + XD]. (19) 

A<0 



Invoking now eq. (fl2l) . we readily identify that 

F(D) = kT R Q (D), (20) 

which supports the analogy between the lossy data compression problem and the 
behavior of the statistical-mechanical model of the kind described in the third paragraph 
of the Introduction: According to this model, the physical system under discussion is 
a long chain with a total of n elements, which is composed of \X\ different types of 
shorter chains (indexed by x), where the number of elements in the short chain of type 
x is nP(x), and where each element of each chain can be in various states, indexed by x. 
In each state x , the internal energy and the length of each element are e(x) and d(x, x), 
as described above. The total length of the chain, when no force is applied, is therefore 
J2?=i (d(xi,Xi)) | a=o = nD . Upon applying a contracting force A < 0, states of shorter 
length become more probable, and the chain shrinks to the length of nD, where D is 
related to A aeeording to the Legendre relafofl ( fl~8l) between F(D) and G(X), which 
is given by 

A = F'(D) = kT R' Q (D), (21) 

% Since G(X) is concave and F(D) is convex, the inverse Legendre transform holds as well, and so, 
there is one-to-one correspondence between A and D. 
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where F'(D) and R'q(D) are, respectively, the derivatives of F(D) and Rq(D) relative 
to D. The inverse relation is, of course, 

D = -G'(A), (22) 

where G'(X) is the derivative of G(X). Since Rq(D) is proportional to the free en- 
ergy, where the system is held in equilibrium at length nD, it also means the minimum 
amount of work required in order to shrink the system from length hDq to length nD, 
and this minimum is obtained by a reversible process of slow increase in A, starting from 
zero and ending at the final value given by eq. (I2T1) . 

Discussion 

This analogy between the lossy source coding problem and the statistical-mechanical 
model of a chain, may suggest that physical insights may shed light on lossy source 
coding and vice versa. We learn, for example, that the contribution of each source 
symbol x to the distortion, Xi =xd(xi,Xi), is analogous to the length contributed by 
the chain of type x when the contracting force A is applied. We have also learned that 
the local slope of Rq(D) is proportional to a force, which must increase as the chain is 
contracted more and more aggressively, and near D = 0, it normally tends to infinity, 
as -Rq(O) = — oo in most cases. This slope parameter also plays a pivotal role in theory 
and practice of lossy source coding: On the theoretical side, it gives rise to a variety of 
parametric representations of the rate-distortion function [2], [5], some of which support 
the derivation of important, non-trivial bounds. On the more practical side, often data 
compression schemes are designed by optimizing an objective function with the structure 
of 

rate + A ■ distortion, 

thus A plays the role of a Lagrange multiplier. This Lagrange multiplier is now 
understood to act like a physical force, which can be 'tuned' to the desired trade- 
off between rate and distortion. As yet another example, the convexity of the rate- 
distortion function can be understood from a physical point of view, as the Helmholtz 
free energy is also convex, a fact which has a physical explanation (related to the 
fluctuation-dissipation theorem), in addition to the mathematical one. 
At this point, two technical comments are in order: 

(i) We emphasized the fact that the reproduction distribution Q is fixed. For a given 
target value of D, one may, of course, have the freedom to select the optimum 
distribution Q* that minimizes Rq(D), which would yield the rate-distortion 
function, R(D), and so, in principle, all the foregoing discussion applies to R(D) 
as well. Some caution, however, must be exercised here, because in general, the 
optimum Q may depend on D (or equivalently, on s or A), which means, that in 
the analogous physical model, the internal energy e(x) depends on the force A (in 
addition to the linear dependence of the term \d(x, x)). This kind of dependence 
does not support the above described analogy in a natural manner. This is the 
reason that we have defined the rate — distortion problem for a fixed Q, as it avoids 
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this problem. Thus, even if we pick the optimum Q* for a given target distortion 
level D, then this Q* must be kept unaltered throughout the entire process of 
increasing A from zero to its final value, given by ( 12TT) . although Q* may be sub- 
optimum for all intermediate distortion values that are met along the way from D Q 



(ii) An alternative interpretation of the parameter s, in the partition function Z x (s), 
could be the (negative) inverse temperature, as was suggested in [15] (see also [TO]). 
In this case, d(x, x) would be the internal energy of an element of type x at state 
x and Q(x), which does not include a power of s, could be understood as being 
proportional to the degeneracy (in some coarse-graining process). In this case, the 
distortion would have the meaning of internal energy, and since no mechanical work 
is involved, this would also be the heat absorbed in the system, whereas Rq(D) 
would be related to the entropy of the system. The Legendre transform, in this case, 
is the one pertaining to the passage between the microcanonical ensemble and the 
canonical one. The advantage of the interpretation of s (or A) as force, as proposed 
here, is that it lends itself naturally to a more general case, where there is more 
than one fidelity criterion. For example, suppose there are two fidelity criteria, 
with distortion functions d and d'. Here, there would be two conjugate forces, A 
and A', respectively (for example, a mechanical force and a magnetic force), and the 
physical analogy carries over. On the other hand, this would not work naturally 
with the temperature interpretation approach since there is only one temperature 
parameter in physics. 

We end this section by providing a representation of Rq(D) and D in an integral 
form, which follows as a simple consequence of its representation as the Legendre 
transform of In Z x (s), as in eq. (Tl2|) . Since the maximization problem in (fT2l) is a 
convex problem (hiZ x (s) is convex in s), the minimizing s for a given D is obtained by 
taking the derivative of the r.h.s., which leads to 



This equation yields the distortion level D for a given value of the minimizing s in eq. 
( fT2l) . Let us then denote 



to D. 




(23) 




(24) 



which means that 




(25) 



Taking the derivative of (|24"|) . we readily obtain 
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( E xeX Q(x)d(x,x)e sd ^ \ 2 

= P(x) ■ Vai s {d(x, x)\x] 

= mmse s {d(x,x)\x}, 

where V&r s {d(x,x)\x} is the variance of d(x,x) 
distribution 

w (x\x) - Q^ eSd(xA) 

The last line of eq. ( 1261) means that the expectation of Var s {<i(a;, x)\x} w.r.t. P is exactly 
the MMSE of estimating d(x, x) based on the 'observation' x using the conditional mean 
of d(x, x) given x as an estimator. Differentiating both sides of eq. ( 1251) . we get 



(26) 

w.r.t. the conditional probability 



(27) 



dR Q (D s 
ds 



dD s 
ds 



mmse s {d(x, x)\x} + D s 
mmse s {d(x, x)\x}, 



d\nZ x (s) 

ds 
-D s 



or, equivalently, 

Rq(Ds) 

and 

D R = Dr 



s' ■ mmse s '{d(x, x)\x}ds', 



mmse s '{d(x, x) \x}ds' . 



(2? 



(29) 



(30) 



In [12], this representation was studied extensively and was found quite useful. In 
particular, simple bounds on the MMSE were shown to yield non-trivial bounds on 
the rate-distortion function in some cases where an exact closed form expression is 
unavailable. The physical analogue of this representation is the fluctuation-dissipation 
theorem, where the conditional variance, or equivalently the MMSE, plays the role of 
the fluctuation, which describes the sensitivity, or the linear response, of the length of 
the system to a small perturbation in the contracting force. If s is interpreted as the 
negative inverse temperature, as was mentioned before, then the MMSE is related to 
the specific heat of the system. 



4. Sources with memory and interacting particles 

The theoretical framework established in the previous section extends, in principle, to 
information sources with memory (non i.i.d. sources), with a natural correspondence 
to a physical system of interacting particles. While the rate-distortion function for 
a general source with memory is unknown, the maximum rate achievable by random 
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coding can still be derived in many cases of interest. Unlike the case of the memoryless 
source, where the best random coding distribution is memoryless as well, when the 
source exhibits memory, there is no apparent reason to believe that good random coding 
distributions should remain memoryless either, but it is not known what the form of the 
optimum random coding distribution is. For example, there is no theorem that asserts 
that the optimum random coding distribution for a Markov source is Markov too. One 
can, however examine various forms of the random coding distributions and compare 
them. Intuitively, the stronger is the memory of the source, the stronger should be the 
memory of the random coding distribution. 

In this section, we demonstrate one family of random coding distributions, with a 
very strong memory, which is inspired by the Curie-Weiss model of spin arrays, that 
possesses long range interactions. Consider the random coding distribution 

Q(X) = ZJWJ) (31) 

where X = { — 1,+1}, B and J are parameters, and Z n (B,J) is the appropriate 
normalization constant. Using the identity, 



cxp 




2 



2nJJ- 



=1 / 

we can represent Q as a mixture of i.i.d. distributions as follows: 



~W~ r+°° ( nO 2 n 1 

U J ^ d6>exp|-^ + 05>}> ( 32 ) 



/+oo 
d97T n (9)Q e (x) (33) 
-oo 



where Qg is the memoryless source 



Q2 

— - ln[2cosh( J B + 



(35) 



Q(,{X) - [2cosh(5 + £)]« (34) 
and the weighting function n n {6) is given by 

^ ) = ^b)/5j exp {- n 

Next, we repeat the earlier derivation for each Q 9 individually: 

Q ^£d( Xi , Xi) <nD^ = d6n n (6)Q e jfj d(x t , x t ) < nD^ 

/+oo 
&6n n {6)e- nR ^ D \ (36) 
-oo 

where Rq{D) is a short-hand notation for Rq (D), which is well defined from the 
previous section since Qe is an i.i.d. distribution. At this point, two observations 
are in order: First, we observe that a separate large deviations analysis for each i.i.d. 
component Qg is better than applying a similar analysis directly to Q itself, without the 
decomposition, since it allows a different optimum choice of s for each 9, rather than one 
optimization of s that compromises all values of 9. Moreover, since the upper bound 
is exponentially tight for each Qg, then the corresponding mixture of bounds is also 
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exponentially tight. The second observation is that since Qg is i.i.d., Rg(D) depends on 
the source P only via the marginal distribution of a single symbol P(x) = Pr{xj = x}, 
which is assumed here to be independent of %. 

A saddle-point analysis gives rise to the following expression for Rq(D), the 
random-coding rate distortion function pertaining to Q, which is the large deviations 
rate function: 

Rq(D) = mm |^ - ln[2cosh(5 + 9)} + R (D)\ + 0(5, J) (37) 



where 



n 



(38) 



We next have a closer look at Rq(D), assuming X = X = { — 1,+1}, and using the 
Hamming distortion function, i.e., 



d(x, x) 



1 — X ■ X 



Since 



^2Q e (x)e 



sd(x,x) 



x = x 

1 x 7^ x 

e (B+9)x 



(39) 



D s(l-xx)/2 



we readily obtain 



Re(D) = max 

s<0 



^2cosh(B + #) " 

e s / 2 cosh(g + 0- sx/2) 
cosh( J B + 9) 



s[D--) -J2 p (x) In cosh (5 + ^- 



(40) 



sx 



(41) 



+ In cosh(5 + 9). 

On substituting this expression back into the expression of Rq(D), we obtain the formula 
Rq(D)= mm ^_+max( S (D-- 



2 cosh [ B + 9 — 



sx\ 
~2J 



<f>(B,J), 



(42) 



which requires merely optimization over two parameters. In fact, the maximization over 
s, for a given 9, can be carried out in closed form, as it boils down to the solution of a 
quadratic equation. Specifically, for a symmetric source (P(— 1) = -P(+l) = 1/2); the 
optimum value of s is given by 



In 



yj(l - 2D) 2 c 2 + AD(l - D) - (1 - 2D)c — ln[2(l — D)], (43) 



where 



c = cosh (2B + 29). (44) 
The details of the derivation of this expression are omitted as they are straightforward. 
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As the Curie- Weiss model is well known to exhibit phase transitions (see, e.g., 
[6], [13]), it is expected that Rq(D), under this model, would consist of phase transitions 
as well. At the very least, the last term <fi(B, J) is definitely subjected to phase 
transitions in B (the magnetic field) and J (the coupling parameter). The first term, 
that contains the minimization over 9, is somewhat more tricky to analyze in closed 
form. In essence, considering s* = s*(9) as a function of 9, substituting it back into 
the expression of Rq(D), and finally, differentiating w.r.t. 9 and equating to zero (in 
order to minimize), then it turns out that the (internal) derivative of s*(9) w.r.t. 9 is 
multiplied by a vanishing expression (by the very definition of s* as a solution to the 
aforementioned quadratic equation). The final result of this manipulation is that the 
minimizing 9 should be a solution to the equation 



This is a certain (rather complicated) variant of the well-known magnetization equation 
in the mean field model, 9 = Jtanh(5 + 9), which is well known to exhibit a first order 
phase transition in B whenever J > J c — 1. It is therefore reasonable to expect that the 
former equation in 9, which is more general, will also have phase transitions, at least in 
some cases. 

5. Summary and Conclusion 

In this paper, we have drawn a conceptually simple analogy between lossy compres- 
sion of memoryless sources and statistical mechanics of a system of non-interacting 
particles. Beyond the belief that this analogy may be interesting on its own right, we 
have demonstrated its usefulness in several levels. In particular, in the last section, 
we have observed that the analogy between the information-theoretic model and the 
physical model is not merely on the pure conceptual level, but moreover, analysis tools 
from statistical mechanics can be harnessed for deriving information-theoretic functions. 
Moreover, physical insights concerning phase transitions, in systems with strong interac- 
tions, can be 'imported' for the understanding possible irregularities in these functions, 
in this case, non-smooth dependence on B and J. 
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